Here is the short version: to get better performance on your
SSL
terminator, use
stud on 64bit system with patch from
meric Brun for SSL session reuse with some
AES cipher
suite (128 or 256, does not really matter), without
DHE, on as many
cores as needed, a key size of 1024 bits unless more is needed.
Introduction
A quick note: when I say SSL, you should read TLS v1
instead. Look at the article for TLS on Wikipedia for some
background on this.
One year ago, Adam Langley, from Google,
stated SSL was not computationally expensive any more:
In January this year (2010), Gmail switched to using HTTPS for
everything by default. Previously it had been introduced as an option,
but now all of our users use HTTPS to secure their email between their
browsers and Google, all the time. In order to do this we had to
deploy no additional machines and no special hardware. On our
production frontend machines, SSL/TLS accounts for less than 1% of the
CPU load, less than 10KB of memory per connection and less than 2% of
network overhead. Many people believe that SSL takes a lot of CPU time
and we hope the above numbers (public for the first time) will help to
dispel that.
If you stop reading now you only need to remember one thing: SSL/TLS
is not computationally expensive any more.
This is a very informative post containing tips on how to enhance SSL
performance by reducing latency. However, unlike Gmail, you may still
be worried by SSL raw performances. Maybe each of your frontends is
able to serve 2000 requests per second and CPU overhead for SSL is
therefore significative. Maybe you want to terminate SSL in your load
balancer (even if you know
this is not the best way to scale).
Tuning SSL
There are a lot of knobs you can use to get better performances from
SSL: choosing the best implementation, using more CPU cores,
switching to 64bit system, choosing the right cipher suite and
the appropriate key size and enabling a session cache.
We will consider three SSL terminators. They all use OpenSSL behind
the hood. stunnel is the oldest one and uses a threaded
model. stud is a recent attempt to write a simple SSL
terminator which is efficient and scalable. It uses the
one-process-per-core model. nginx is a web server and it
can be used as reverse proxy and therefore act as SSL terminator. It
is known to be one of the most efficient web server, hence the choice
here. It also features built-in basic load balancing. Since stud and
stunnel does not have this feature, we use them with
HAProxy, an high performance load-balancer that usually
defers the SSL part to stunnel (but stud can act as a drop-in
replacement here).
Those implementations were already tested by
Matt Stancliff. He first concluded
nginx sucked at SSL then that
nginx did not suck at SSL (in the first case, nginx was the
only one to select a DHE cipher suite). The details were very
scarce. I hope to bring here more details.
This brings the importance of the selected cipher suite. The client
and the server must agree on the cipher suite to use for symmetric
encryption. The server will select the strongest one it supports from
the list proposed by the client. If you enable some expensive cipher
suite on your server, it is likely to be selected.
The last important knob is SSL session reuse. It matters for both
performance and latency. During the first handshake, the server will
send a session ID. The client can use this session ID to request an
abbreviated handshake. This handshake is shorter (latency
improvement by one round-trip) and allows to reuse the previously
negociated master secret (performance improvement by skipping the
costliest part of the handshake). See the
article from Adam Langley for additional information (with
helpful drawings).
The following table shows what cipher suite is selected by some major
web sites. I have also indicated wheter they support session
reuse. See this
post from Matt Palmer to know how to check this.
Site |
Cipher |
Session reuse? |
www.google.com |
RC4-SHA |
|
www.facebook.com |
RC4-MD5 |
|
twitter.com |
AES256-SHA |
|
windowsupdate.microsoft.com |
RC4-MD5 |
|
www.paypal.com |
AES256-SHA |
|
www.cmcicpaiement.fr |
DHE-RSA-AES256-SHA |
|
RFC 5077 defines another mechanism for session resumption
that does not require any server-side state. I did not investigate
much how this works but it does not rely on the session
ID.
Benchmarks
All those benchmarks have been run on a HP DL 380 G7, with two
Xeon L5630 (running at 2.13GHz for a total of 8 cores),
without hyperthreading, using a 2.6.39 kernel (HZ
is set to 250) and
two Intel 82576 NIC. Linux conntrack subsystem has been
disabled and file limits have been raised over 100,000.
A Spirent Avalanche 2900 appliance is used to run the benchmarks.
We are interested in raw SSL performances in term of handshakes per
second. Because we want to observe the effects of session reuse, the
scenario runs by most tests lets clients doing four successive
requests and reusing the SSL session for three of them. There is no
HTTP keepalive. There is no compression. The size of the HTTP answer
from the server is 1024 bytes. Those choices are done because our
primary target is the number of SSL handshakes per second.
It should also be noted that HAProxy alone is able to handle 22,000
TPS on one core. During the tests, it was never the bottleneck.
Implementation
We run a first bench to compare nginx, stud and stunnel on one
core. This bench runs on a 32bit system, with AES128-SHA1 cipher suite
and a 1024bit key size. Here is the result for stud (version 0.1,
with meric Brun s patch for SSL session reuse):
The most important plot is the top one. The blue line is the attempted
number of transactions per second (TPS) while the green one is the
successful number of TPS. When the two lines start to diverge, this
means we reach some kind of maximum: the number of unsuccessful TPS
also starts to raise (the red line).
There are several noticeable points: the maximum TPS (788 TPS), the
maximum TPS with an average response time less than 100 ms (766 TPS),
less than 500 ms (776 TPS) and the maximum TPS with less than 0.01% of
packet loss (783 TPS).
Let s have a look at nginx (version 1.0.5):
It is able to get the same performance as stud (763 TPS). However,
over 512 TPS, you get more than 100 ms of response time. Over 556 TPS,
you even get more than 500 ms! I get this behaviour in every bench
with nginx (with or without proxy_buffering
, with or without
tcp_nopush
, with or without tcp_nodelay
) and I am unable to
explain it. Maybe there is
something wrong in the configuration. After hitting this
limit, nginx starts to process connections by bursts. Therefore, the
number of successful TPS is plotted using a dotted line while the
moving average over 20 seconds is plotted with a solid line.
On the next plot, you can see the compared performance of the three
implementations. On this kind of plot, the number of TPS that we keep
is the maximum number of TPS where loss is less than 0.1% and average
response time is less than 100 ms. stud achieves a performance of
766 TPS while nginx and stunnel are just above 500 TPS.
Number of cores
With multiple cores, we can affect several of them to do SSL hard
work. To get better performance, we pin each processes to a dedicated
CPU. Here is the repartition used during the tests:
|
1 core |
2 cores |
4 cores |
6 cores |
CPU 1, Core 1 |
network |
network |
network |
network + haproxy |
CPU 1, Core 2 |
haproxy |
haproxy |
haproxy |
SSL |
CPU 1, Core 3 |
SSL |
SSL |
SSL |
SSL |
CPU 1, Core 4 |
- |
SSL |
SSL |
SSL |
CPU 2, Core 5 |
- |
- |
network |
SSL |
CPU 2, Core 6 |
- |
- |
SSL |
SSL |
CPU 2, Core 7 |
- |
- |
SSL |
SSL |
CPU 2, Core 8 |
system |
system |
system |
system + haproxy |
Remember, we have two
CPU, 4 cores on each
CPU. Cores on the same
CPU
share the same L2 cache (on this model). Therefore, the arrangement
inside a
CPU is not really important. When possible, we try to keep
things together on the same
CPU. The
SSL processes always get
exclusive use of a core since they will always be the most busy. The
repartition is done with
cpuset(7) for userland processes
and by setting
smp_affinity
for network card interrupts.
We keep one core for the system. It is quite important to be able to
connect and monitor the system correctly even when it is loaded. With
so many cores, we can afford to reserve one core for this usage.
Beware! There is a trap when pining processes or
IRQ to a core. You
need to check
/proc/cpuinfo
to discover the mapping between kernel
processors and physical cores. Sometimes, second kernel processor can
be the first core of the second physical processor (instead of the
second core of the first physical processor).
As you can see in the plot above, only
stud is able to scale
properly.
stunnel is not able to take advantage of the cores and its
performances are worse than with a single core. I think this may be
due to its threaded model and the fact that the userland for this
32bit system is a bit old.
nginx is able to achieve the same
TPS
than
stud but is disqualified because of the increased latency. We
won t use
stunnel in the remaining tests.
32bit vs 64bit
The easiest way to achieve a performance boost is to switch to a 64bit
system. TPS are doubled. On our 64bit system, thanks to the use of a
version of OpenSSL with the appropriate support, AES-NI, a CPU
extension to improve the speed of AES encryption/decryption, is
enabled. Remaining tests are done with a 64bit system.
Ciphers and key sizes
In our tests, the influence of ciphers is minimal. There is almost no
difference between AES256 and AES128. Using RC4 adds some latency. The
use of AES-NI may have helped to avoid this latency for AES.
On the other hand, using a 2048bit key size has a huge performance
hit. TPS are divided by 5.
Session cache
The last thing to check is the influence of SSL session reuse. Since
we are on a local network, we see only one of the positive effects of
session reuse: better TPS because of reduced CPU usages. If there was
a latency penalty, we should also have seen better
TPS thanks to the removed round-trip.
This plot also explains why stud performances fall after the
maximum: because of the failed transactions, the session cache is not
as efficient. This phenomenon does not exist when session reuse is not enabled.
Conclusion
Here is a summary of the maximum TPS reached during the benchmark
(with an average response time below 100ms). Our control use case is
the following: 64bit, 6 cores, AES128, SHA1, 1024bit key, 4 requests
into the same SSL session.
Context |
nginx 1.0.5 |
stunnel 4.41 |
stud 0.1 (patched) |
1 core, 32bit |
512 TPS |
503 TPS |
766 TPS |
2 cores, 32bit |
599 TPS |
- |
- |
32bit |
804 TPS |
501 TPS |
4251 TPS |
- |
1799 TPS |
- |
9000 TPS |
AES256 |
- |
- |
8880 TPS |
RC4-MD5 |
- |
- |
7370 TPS |
2048bit |
- |
- |
1643 TPS |
no session reuse |
- |
- |
5844 TPS |
80 requests per SSL session |
- |
- |
10797 TPS |
Therefore,
stud in our control scenario is able to sustain 1500
TPS
per core. It seems to be the best current option for a
SSL
termination. It is not available in Debian yet, but I
intend to package it.
Here is how
stud is invoked:
# ulimit -n 100000
# stud -n 2 -f 172.31.200.15,443 -b 127.0.0.1,80 -c ALL \
> -B 1000 -C 20000 --write-proxy \
> =(cat server1024.crt server1024.key dhe1024)
You may also want to look at
HAProxy configuration,
nginx configuration and
stunnel configuration.